<<<<<<< HEAD Loan Data from Prosper

========================================================

This dataset contains 113,937 loans with 81 variables for each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information. The data set was last updated on March 11, 2014.

Univariate Plots Section

## [1] 113937     81

In this analysis, we will limit the amount of variables we analyze to 13 rather than 81 variables. The variables below were selected out of an interest to see any relationships and trends between them.

## [1] 113937     13
## 'data.frame':    113937 obs. of  13 variables:
##  $ Term                 : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus           : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ BorrowerRate         : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ ProsperScore         : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ EmploymentStatus     : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ IsBorrowerHomeowner  : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CreditScoreRangeLower: int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper: int  659 699 499 819 699 759 699 719 839 839 ...
##  $ OpenCreditLines      : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ AmountDelinquent     : num  472 0 NA 10056 0 ...
##  $ DebtToIncomeRatio    : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ StatedMonthlyIncome  : num  3083 6125 2083 2875 9583 ...
##  $ MonthlyLoanPayment   : num  330 319 123 321 564 ...

I will try to order the factor levels of loan status and employment status by perceived level of benefit and risk for the loan company (from worst to best). I understand this may be subjective, but I want to see how the distributions and plots display with these variables ordered.

##  [1] "ChargedOff"             "Defaulted"             
##  [3] "Cancelled"              "Past Due (>120 days)"  
##  [5] "Past Due (91-120 days)" "Past Due (61-90 days)" 
##  [7] "Past Due (31-60 days)"  "Past Due (16-30 days)" 
##  [9] "Past Due (1-15 days)"   "FinalPaymentInProgress"
## [11] "Current"                "Completed"
## [1] "Not employed"  "No answer"     "Not available" "Other"        
## [5] "Retired"       "Self-employed" "Part-time"     "Employed"     
## [9] "Full-time"
##       Term                       LoanStatus     BorrowerRate   
##  Min.   :12.00   Current              :56576   Min.   :0.0000  
##  1st Qu.:36.00   Completed            :38074   1st Qu.:0.1340  
##  Median :36.00   Defaulted            : 5018   Median :0.1840  
##  Mean   :40.83   Past Due (1-15 days) :  806   Mean   :0.1928  
##  3rd Qu.:36.00   Past Due (31-60 days):  363   3rd Qu.:0.2500  
##  Max.   :60.00   (Other)              : 1108   Max.   :0.4975  
##                  NA's                 :11992                   
##   ProsperScore        EmploymentStatus IsBorrowerHomeowner
##  Min.   : 1.00   Employed     :67322   False:56459        
##  1st Qu.: 4.00   Full-time    :26355   True :57478        
##  Median : 6.00   Self-employed: 6134                      
##  Mean   : 5.95   Not available: 5347                      
##  3rd Qu.: 8.00   Other        : 3806                      
##  Max.   :11.00   No answer    : 2255                      
##  NA's   :29084   (Other)      : 2718                      
##  CreditScoreRangeLower CreditScoreRangeUpper OpenCreditLines
##  Min.   :  0.0         Min.   : 19.0         Min.   : 0.00  
##  1st Qu.:660.0         1st Qu.:679.0         1st Qu.: 6.00  
##  Median :680.0         Median :699.0         Median : 9.00  
##  Mean   :685.6         Mean   :704.6         Mean   : 9.26  
##  3rd Qu.:720.0         3rd Qu.:739.0         3rd Qu.:12.00  
##  Max.   :880.0         Max.   :899.0         Max.   :54.00  
##  NA's   :591           NA's   :591           NA's   :7604   
##  AmountDelinquent   DebtToIncomeRatio StatedMonthlyIncome
##  Min.   :     0.0   Min.   : 0.000    Min.   :      0    
##  1st Qu.:     0.0   1st Qu.: 0.140    1st Qu.:   3200    
##  Median :     0.0   Median : 0.220    Median :   4667    
##  Mean   :   984.5   Mean   : 0.276    Mean   :   5608    
##  3rd Qu.:     0.0   3rd Qu.: 0.320    3rd Qu.:   6825    
##  Max.   :463881.0   Max.   :10.010    Max.   :1750003    
##  NA's   :7622       NA's   :8554                         
##  MonthlyLoanPayment
##  Min.   :   0.0    
##  1st Qu.: 131.6    
##  Median : 217.7    
##  Mean   : 272.5    
##  3rd Qu.: 371.6    
##  Max.   :2251.5    
## 

## 
##    12    36    60 
##  1614 87778 24545

There are only three distinct values to the loan term (12, 36, and 60 months). Going forward, we will treat the Term variable as a factor due to its categorical nature.

We can clearly see that most loans have a term of 36 months, and it’s much less common for a loan to have a term of 12 months.

## 
##             ChargedOff              Defaulted              Cancelled 
##                      0                   5018                      5 
##   Past Due (>120 days) Past Due (91-120 days)  Past Due (61-90 days) 
##                     16                    304                    313 
##  Past Due (31-60 days)  Past Due (16-30 days)   Past Due (1-15 days) 
##                    363                    265                    806 
## FinalPaymentInProgress                Current              Completed 
##                    205                  56576                  38074

Most of the loans are either Current or Completed, but there are also many that are Charged off or Defaulted. Of the loans that are Past Due, the highest number are past due by 1-15 days. There are hardly any loans that are categorized as > 120 days past due. My theory is that this is due to the loans being considered Charged off shortly after they become > 120 days past due.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   131.6   217.7   272.5   371.6  2251.5

Monthly loan payment is skewed to the right, with a minimum of $0 and a maximum of $2251.50. The minimum of $0 is interesting, as I expected all monthly loan payments to be greater than $0. I want to take a closer look at loans with monthly payments of $0 and those with high monthly payments.

##  Term                      LoanStatus   BorrowerRate     ProsperScore   
##  12:161   Completed             :800   Min.   :0.0500   Min.   : 1.000  
##  36:774   Defaulted             :131   1st Qu.:0.1224   1st Qu.: 6.000  
##  60:  0   FinalPaymentInProgress:  4   Median :0.1875   Median : 8.000  
##           ChargedOff            :  0   Mean   :0.1943   Mean   : 7.605  
##           Cancelled             :  0   3rd Qu.:0.2635   3rd Qu.: 9.000  
##           Past Due (>120 days)  :  0   Max.   :0.3600   Max.   :10.000  
##           (Other)               :  0                    NA's   :451     
##       EmploymentStatus IsBorrowerHomeowner CreditScoreRangeLower
##  Full-time    :444     False:558           Min.   :  0.0        
##  Employed     :269     True :377           1st Qu.:580.0        
##  Not available: 79                         Median :660.0        
##  No answer    : 39                         Mean   :652.7        
##  Self-employed: 35                         3rd Qu.:720.0        
##  Part-time    : 35                         Max.   :860.0        
##  (Other)      : 34                         NA's   :11           
##  CreditScoreRangeUpper OpenCreditLines  AmountDelinquent 
##  Min.   : 19.0         Min.   : 0.000   Min.   :    0.0  
##  1st Qu.:599.0         1st Qu.: 4.000   1st Qu.:    0.0  
##  Median :679.0         Median : 7.000   Median :    0.0  
##  Mean   :671.7         Mean   : 7.543   Mean   :  878.4  
##  3rd Qu.:739.0         3rd Qu.:10.000   3rd Qu.:   20.0  
##  Max.   :879.0         Max.   :27.000   Max.   :81723.0  
##  NA's   :11            NA's   :118      NA's   :118      
##  DebtToIncomeRatio StatedMonthlyIncome MonthlyLoanPayment
##  Min.   : 0.0100   Min.   :     0      Min.   :0         
##  1st Qu.: 0.1000   1st Qu.:  2750      1st Qu.:0         
##  Median : 0.1700   Median :  4110      Median :0         
##  Mean   : 0.2441   Mean   :  4836      Mean   :0         
##  3rd Qu.: 0.2800   3rd Qu.:  5944      3rd Qu.:0         
##  Max.   :10.0100   Max.   :108333      Max.   :0         
##  NA's   :64

Subsetting the data to view loans with monthly payments of $0, we can see that categories such as CreditScoreRangeLower and StatedMonthlyIncome are lower than the overall median. Interestingly, the ProsperScore is higher than average for this subset.

##  Term                    LoanStatus  BorrowerRate      ProsperScore  
##  12:8   Completed             :7    Min.   :0.05650   Min.   : 3.00  
##  36:0   FinalPaymentInProgress:1    1st Qu.:0.06248   1st Qu.: 7.25  
##  60:0   ChargedOff            :0    Median :0.07715   Median : 8.00  
##         Defaulted             :0    Mean   :0.09001   Mean   : 7.50  
##         Cancelled             :0    3rd Qu.:0.11865   3rd Qu.: 8.50  
##         Past Due (>120 days)  :0    Max.   :0.14580   Max.   :10.00  
##         (Other)               :0                                     
##       EmploymentStatus IsBorrowerHomeowner CreditScoreRangeLower
##  Employed     :8       False:4             Min.   :740.0        
##  Not employed :0       True :4             1st Qu.:755.0        
##  No answer    :0                           Median :760.0        
##  Not available:0                           Mean   :767.5        
##  Other        :0                           3rd Qu.:780.0        
##  Retired      :0                           Max.   :820.0        
##  (Other)      :0                                                
##  CreditScoreRangeUpper OpenCreditLines AmountDelinquent DebtToIncomeRatio
##  Min.   :759.0         Min.   : 6.00   Min.   :0        Min.   :0.1100   
##  1st Qu.:774.0         1st Qu.: 7.75   1st Qu.:0        1st Qu.:0.1625   
##  Median :779.0         Median : 9.50   Median :0        Median :0.1800   
##  Mean   :786.5         Mean   :11.00   Mean   :0        Mean   :0.2137   
##  3rd Qu.:799.0         3rd Qu.:13.25   3rd Qu.:0        3rd Qu.:0.2550   
##  Max.   :839.0         Max.   :21.00   Max.   :0        Max.   :0.3900   
##                                                                          
##  StatedMonthlyIncome MonthlyLoanPayment
##  Min.   : 7292       Min.   :2112      
##  1st Qu.:12333       1st Qu.:2144      
##  Median :18125       Median :2159      
##  Mean   :16589       Mean   :2170      
##  3rd Qu.:20208       3rd Qu.:2189      
##  Max.   :25000       Max.   :2252      
## 

Subsetting the data to view loans with monthly payments over $2000, it seems that the borrower generally has a high stated monthly income, has no amount delinquent, is employed, and has a high credit score. Also, the terms are all 12 months and the borrower rates are all under 0.15. The borrowers in these loans appear to be “low risk” borrowers as the higher median Prosper Score indicates.

The Prosper Score appears to be a normal distribution. The plot above excludes any ‘NA’ values. The Prosper Score is also supposed to range from 1-10, so the scores of 11 were unexpected. I’m hesitant to treat a score of 11 as the best score since it may be an entry error. Let us subset the data to see if there is a trend in the loans with a Prosper Score of 11 and if the data is trustworthy.

##  Term                       LoanStatus    BorrowerRate      ProsperScore
##  12:   0   Current               :1419   Min.   :0.06050   Min.   :11   
##  36:1049   Completed             :  31   1st Qu.:0.06590   1st Qu.:11   
##  60: 407   FinalPaymentInProgress:   3   Median :0.08690   Median :11   
##            Defaulted             :   1   Mean   :0.09328   Mean   :11   
##            Past Due (61-90 days) :   1   3rd Qu.:0.10990   3rd Qu.:11   
##            Past Due (1-15 days)  :   1   Max.   :0.19500   Max.   :11   
##            (Other)               :   0                                  
##       EmploymentStatus IsBorrowerHomeowner CreditScoreRangeLower
##  Employed     :1365    False:483           Min.   :640          
##  Self-employed:  43    True :973           1st Qu.:700          
##  Other        :  34                        Median :720          
##  Full-time    :  14                        Mean   :732          
##  Not employed :   0                        3rd Qu.:760          
##  No answer    :   0                        Max.   :840          
##  (Other)      :   0                                             
##  CreditScoreRangeUpper OpenCreditLines AmountDelinquent DebtToIncomeRatio
##  Min.   :659           Min.   : 2.00   Min.   :     0   Min.   :0.0200   
##  1st Qu.:719           1st Qu.: 8.00   1st Qu.:     0   1st Qu.:0.1300   
##  Median :739           Median :11.00   Median :     0   Median :0.1900   
##  Mean   :751           Mean   :11.32   Mean   :   298   Mean   :0.2007   
##  3rd Qu.:779           3rd Qu.:14.00   3rd Qu.:     0   3rd Qu.:0.2500   
##  Max.   :859           Max.   :38.00   Max.   :106071   Max.   :0.5800   
##                                                         NA's   :44       
##  StatedMonthlyIncome MonthlyLoanPayment
##  Min.   : 1200       Min.   :  43.47   
##  1st Qu.: 5417       1st Qu.: 243.56   
##  Median : 7500       Median : 402.31   
##  Mean   : 8850       Mean   : 424.04   
##  3rd Qu.:10833       3rd Qu.: 581.40   
##  Max.   :76667       Max.   :1162.33   
## 

The summary of loans with a Prosper Score of 11 do seem to indicate borrowers that are low risk. The median StatedMonthlyIncome and OpenCreditLines are higher than the overall median. I think this data can be trusted and used.

Borrower Rate has a fairly normal distribution, but there is a second peak around 0.31. The Prosper Score and Borrower Rate are both generated values, and I’m interested in exploring which variables affect these values.

Almost all of the borrowers are employed. A limitation of this dataset that can be recognized here is that the data is only for those who applied for a loan and were approved. The “riskier”" loan applicants (unemployed) that were denied loans are not a part of this dataset.

I also wonder if leaving the response blank, or the unclear responses of “Not available” and “Other” affect the borrower’s interest rate and Prosper score.

There is an even split between homeowners and non-homeowners. Does being a homeowner affect the Prosper Score? How about the borrower rate?

The stated monthly income is skewed to the right, but the log transformation produces a more normal distribution. There are very large outliers in the stated monthly income data so the top 1% of data was trimmed for the plot.

Similarly, a majority of the Debt to Income Ratios were between 0 and 1 with some values at the max of 10.01. The top 1% of data was trimmed for the plot above, showing a fairly normal distribution with some skew to the right. I want to take a look at the highest values of debt to income ratio.

##  Term                     LoanStatus   BorrowerRate     ProsperScore  
##  12:  1   Completed            :124   Min.   :0.0100   Min.   :1.000  
##  36:263   Defaulted            : 42   1st Qu.:0.1497   1st Qu.:3.000  
##  60:  8   Current              : 23   Median :0.1900   Median :4.500  
##           Past Due (61-90 days):  1   Mean   :0.2013   Mean   :4.478  
##           Past Due (16-30 days):  1   3rd Qu.:0.2699   3rd Qu.:6.000  
##           (Other)              :  0   Max.   :0.3500   Max.   :8.000  
##           NA's                 : 81                    NA's   :226    
##       EmploymentStatus IsBorrowerHomeowner CreditScoreRangeLower
##  Self-employed:74      False:145           Min.   :460.0        
##  Full-time    :65      True :127           1st Qu.:620.0        
##  Not available:37                          Median :680.0        
##  Employed     :36                          Mean   :668.8        
##  Not employed :27                          3rd Qu.:720.0        
##  Part-time    :17                          Max.   :860.0        
##  (Other)      :16                                               
##  CreditScoreRangeUpper OpenCreditLines  AmountDelinquent 
##  Min.   :479.0         Min.   : 0.000   Min.   :    0.0  
##  1st Qu.:639.0         1st Qu.: 5.000   1st Qu.:    0.0  
##  Median :699.0         Median : 8.000   Median :    0.0  
##  Mean   :687.8         Mean   : 8.694   Mean   :  734.9  
##  3rd Qu.:739.0         3rd Qu.:12.000   3rd Qu.:    0.0  
##  Max.   :879.0         Max.   :30.000   Max.   :37077.0  
##                        NA's   :43       NA's   :43       
##  DebtToIncomeRatio StatedMonthlyIncome MonthlyLoanPayment
##  Min.   :10.01     Min.   :    0.000   Min.   :   0.0    
##  1st Qu.:10.01     1st Qu.:    0.083   1st Qu.: 107.2    
##  Median :10.01     Median :    0.083   Median : 209.5    
##  Mean   :10.01     Mean   :  133.299   Mean   : 301.9    
##  3rd Qu.:10.01     3rd Qu.:    0.396   3rd Qu.: 388.6    
##  Max.   :10.01     Max.   :17083.333   Max.   :1047.6    
## 

Subsetting the data by Debt to Income Ratios greater than 10, we begin to see riskier Prosper scores, more Charged off or Defaulted loans, and much lower levels of full-time employment and stated monthly incomes.

## 
##     0   360   420   440   460   480   500   520   540   560   580   600 
##   133     1     5    36   141   346   554  1593  1474  1357  1125  3602 
##   620   640   660   680   700   720   740   760   780   800   820   840 
##  4172 12199 16366 16492 15471 12923  9267  6606  4624  2644  1409   567 
##   860   880 
##   212    27
## 
##    19   379   439   459   479   499   519   539   559   579   599   619 
##   133     1     5    36   141   346   554  1593  1474  1357  1125  3602 
##   639   659   679   699   719   739   759   779   799   819   839   859 
##  4172 12199 16366 16492 15471 12923  9267  6606  4624  2644  1409   567 
##   879   899 
##   212    27

The credit score variables both have normal distributions that look identical, but shifted on the x-axis. Looking at the tables of the values, we can see that the lower credit scores all end in 0 and the upper scores all end in 9. The scores also increment by a factor of 20 and the upper credit scores are all higher than the lower scores by exactly 19. Since these values represent a credit score range that seems to vary by exactly 19 for each borrower, we will create a new variable CreditScore that is the mean of CreditScoreRangeLower and CreditScoreRangeUpper.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     9.5   669.5   689.5   695.1   729.5   889.5     591

According to ValuePenguin, the average credit score in the U.S. in 2014 was 693. Here we can see that the mean (695) and median (690) of the CreditScore in our data set are right around this value. I’m surprised to see here that the credit scores for those approved for loans is not higher than the average in the U.S. in 2014.

We can also see here that credit scores in our data set range from 9.5 to 889.5, but the actual range for credit scores is 300 to 850. Let’s explore some loans with credit scores outside the normal range.

##       Term LoanStatus BorrowerRate ProsperScore EmploymentStatus
## 795     36  Completed       0.2900           NA    Not available
## 913     36  Defaulted       0.2600           NA    Not available
## 1686    36       <NA>       0.1625           NA    Not available
## 4291    36       <NA>       0.2375           NA        No answer
## 4372    36  Defaulted       0.2325           NA    Not available
## 4431    36  Defaulted       0.1450           NA    Not available
## 4673    36       <NA>       0.1930           NA    Not available
## 5685    36  Completed       0.2300           NA    Not available
## 6878    36       <NA>       0.0700           NA    Not available
## 7740    36       <NA>       0.2900           NA    Not available
## 9049    36  Defaulted       0.2900           NA    Not available
## 9474    36       <NA>       0.2500           NA    Not available
## 13396   36  Defaulted       0.2375           NA        No answer
## 13526   36       <NA>       0.2000           NA    Not available
## 13687   36  Completed       0.1700           NA    Not available
## 13939   36  Defaulted       0.1700           NA    Not available
## 14339   36       <NA>       0.1700           NA    Not available
## 14726   36       <NA>       0.1700           NA    Not available
## 15094   36  Defaulted       0.2900           NA        No answer
## 15138   36       <NA>       0.2500           NA        No answer
##       IsBorrowerHomeowner CreditScoreRangeLower CreditScoreRangeUpper
## 795                 False                     0                    19
## 913                 False                     0                    19
## 1686                False                     0                    19
## 4291                False                     0                    19
## 4372                False                     0                    19
## 4431                False                     0                    19
## 4673                False                     0                    19
## 5685                False                     0                    19
## 6878                False                     0                    19
## 7740                False                     0                    19
## 9049                False                     0                    19
## 9474                False                     0                    19
## 13396                True                     0                    19
## 13526               False                     0                    19
## 13687               False                     0                    19
## 13939               False                     0                    19
## 14339               False                     0                    19
## 14726               False                     0                    19
## 15094               False                     0                    19
## 15138               False                     0                    19
##       OpenCreditLines AmountDelinquent DebtToIncomeRatio
## 795                NA               NA                NA
## 913                NA               NA           0.03000
## 1686               NA               NA                NA
## 4291               NA               NA           0.10368
## 4372               NA               NA                NA
## 4431               NA               NA                NA
## 4673               NA               NA           0.02000
## 5685               NA               NA           0.09000
## 6878               NA               NA                NA
## 7740               NA               NA           0.07000
## 9049               NA               NA           0.01000
## 9474               NA               NA           0.05000
## 13396              NA               NA           0.02393
## 13526              NA               NA                NA
## 13687              NA               NA                NA
## 13939              NA               NA           0.02000
## 14339              NA               NA                NA
## 14726              NA               NA           0.13000
## 15094              NA               NA           0.14000
## 15138              NA               NA                NA
##       StatedMonthlyIncome MonthlyLoanPayment CreditScore
## 795              115.5833             167.62         9.5
## 913             2316.6667              60.44         9.5
## 1686            1600.0000              35.28         9.5
## 4291            2916.6667             156.41         9.5
## 4372               0.0000              58.26         9.5
## 4431             166.6667              34.42         9.5
## 4673            2666.6667              44.17         9.5
## 5685            1083.3333              98.71         9.5
## 6878            2227.5000              30.88         9.5
## 7740            1400.0000             104.76         9.5
## 9049            3166.6667              41.91         9.5
## 9474            2000.0000             101.39         9.5
## 13396           4166.6667              99.71         9.5
## 13526           3896.6667             111.49         9.5
## 13687           2050.0000              35.65         9.5
## 13939           2083.3333              35.65         9.5
## 14339           1083.3333              42.78         9.5
## 14726            916.6667              57.04         9.5
## 15094           1200.0000              83.81         9.5
## 15138           1500.0000             119.28         9.5
##       Term LoanStatus BorrowerRate ProsperScore EmploymentStatus
## 186     36  Defaulted       0.2365            6    Self-employed
## 936     36  Completed       0.0499           NA        Full-time
## 1094    36    Current       0.0759            8         Employed
## 1402    36  Completed       0.0820           NA    Not available
## 1576    36    Current       0.0625           10         Employed
## 2024    36  Completed       0.3199            3    Self-employed
## 2117    36  Completed       0.0750           NA    Self-employed
## 2987    36  Completed       0.0695           NA          Retired
## 4083    36  Completed       0.0945            8        Full-time
## 4550    36  Completed       0.0854           10         Employed
## 6521    60    Current       0.0779           10         Employed
## 7183    36  Completed       0.0590           NA        Full-time
## 7449    12  Completed       0.0628            9          Retired
## 7475    36  Completed       0.1390            8         Employed
## 7835    36    Current       0.0649           10         Employed
## 8075    36  Completed       0.0830            8        Full-time
## 8731    36  Completed       0.0599           10         Employed
## 8852    36  Completed       0.0825           10         Employed
## 8993    36    Current       0.2049            7         Employed
## 10160   36  Completed       0.1559            5    Self-employed
##       IsBorrowerHomeowner CreditScoreRangeLower CreditScoreRangeUpper
## 186                  True                   860                   879
## 936                  True                   860                   879
## 1094                 True                   860                   879
## 1402                 True                   860                   879
## 1576                 True                   860                   879
## 2024                 True                   860                   879
## 2117                 True                   860                   879
## 2987                 True                   860                   879
## 4083                 True                   860                   879
## 4550                 True                   860                   879
## 6521                 True                   860                   879
## 7183                 True                   860                   879
## 7449                 True                   860                   879
## 7475                 True                   860                   879
## 7835                 True                   860                   879
## 8075                 True                   880                   899
## 8731                 True                   860                   879
## 8852                 True                   860                   879
## 8993                 True                   880                   899
## 10160                True                   860                   879
##       OpenCreditLines AmountDelinquent DebtToIncomeRatio
## 186                 5                0                NA
## 936                 4                0              0.00
## 1094                6                0              0.17
## 1402               NA               NA              0.02
## 1576                8                0              0.03
## 2024                7                0                NA
## 2117                7                0              0.03
## 2987                8                0              0.05
## 4083               12                0              0.15
## 4550                4                0              0.02
## 6521                7                0              0.08
## 7183               15                0              0.14
## 7449               13                0              0.13
## 7475               14                0              0.13
## 7835                7                0              0.06
## 8075                8                0                NA
## 8731                7                0              0.05
## 8852                5                0              0.03
## 8993                7                0              0.26
## 10160              10                0                NA
##       StatedMonthlyIncome MonthlyLoanPayment CreditScore
## 186              3333.333             234.30       869.5
## 936              9000.000              29.97       869.5
## 1094             6833.333             186.89       869.5
## 1402             5000.000              94.29       869.5
## 1576            17916.667             213.75       869.5
## 2024             6250.000             108.87       869.5
## 2117            18333.333              31.11       869.5
## 2987             1258.000              46.28       869.5
## 4083            10583.333             640.19       869.5
## 4550             8108.333              94.76       869.5
## 6521             8666.667             282.46       869.5
## 7183             7500.000             121.51       869.5
## 7449            10000.000             172.39       869.5
## 7475            32083.333             563.13       869.5
## 7835             8333.333             245.16       869.5
## 8075             6250.000             157.37       889.5
## 8731             7291.667             243.34       869.5
## 8852             9333.333             235.89       869.5
## 8993             2333.333             149.65       889.5
## 10160           10000.000             524.32       869.5

These loans do seem to be missing a fair amount of values (there are a lot of Prosper scores missing), but since the data here do contain values for many of the variables I will not remove these rows. I will remember to subset the data for any plots or calculations involving credit score to only include scores from 300 to 850.

The number of open credit lines is a fairly normal distribution. Most people have less than 20 open credit lines. The distribution of amount delinquent is, alternatively, clearly skewed to the right. Most people are not deliquent at all. A log transformation of the AmountDelinquent data produces a normal distribution.

Univariate Analysis

What is the structure of your dataset?

The data set being analyzed contains 113,937 loans with 13 variables (Term, LoanStatus, MonthlyLoanPayment, ProsperScore, BorrowerRate, EmploymentStatus, IsBorrowerHomeowner, StatedMonthlyIncome, DebtToIncomeRatio, CreditScoreRangeLower, CreditScoreRangeUpper, OpenCreditLines, and AmountDelinquent).

The original data set contained 81 variables, but in this project we are focusing on 13 variables of interest.

Some initial observations from the data are: - Most loans have a 36 month term - Borrower rate peaks around 0.15, but also again around 0.31 - The median upper credit score is about the same as the average U.S credit score - There are extreme values in the debt to income ratio at 10.01 - Some monthly loan payments are much higher than average - Some Prosper scores are an 11, outside the scale of 1-10

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is ProsperScore. I want to explore how the borrower profile such as EmploymentStatus, CreditScore, and StatedMonthlyIncome affect the ProsperScore. Also, I want to see how ProsperScore relates to the loan details such as BorrowerRate, MonthlyLoanPayment, and LoanStatus.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

As stated above, I think the variables relating to the borrower’s profile will all contribute to the ProsperScore. I think the CreditScore, StatedMonthlyIncome, and DebtToIncomeRatio will contribute the most. Further, I think the ProsperScore will correlate with the BorrowerRate and MonthlyLoanPayment, and will be predictive of the LoanStatus.

Did you create any new variables from existing variables in the dataset?

I created CreditScore by taking the average of the upper and lower credit scores, and I made Term a factor variable since the numerical values fell within 3 distinct categories. I also ordered the EmploymentStatus and LoanStatus factors based on my perception of best to worst from the view of the loan company.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I performed a log transformation on the StatedMonthlyIncome and AmountDelinquent since the data in both plots was skewed to the right. I wanted to see if the log transformation would normalize the data. The transformed data did create a more normal looking distribution for both variables.

Bivariate Plots Section

##                     BorrowerRate ProsperScore OpenCreditLines
## BorrowerRate                1.00        -0.66           -0.11
## ProsperScore               -0.66         1.00           -0.01
## OpenCreditLines            -0.11        -0.01            1.00
## AmountDelinquent            0.05        -0.04           -0.06
## DebtToIncomeRatio           0.13        -0.14            0.15
## StatedMonthlyIncome        -0.15         0.16            0.19
## MonthlyLoanPayment         -0.33         0.18            0.19
## CreditScore                -0.53         0.38            0.10
##                     AmountDelinquent DebtToIncomeRatio StatedMonthlyIncome
## BorrowerRate                    0.05              0.13               -0.15
## ProsperScore                   -0.04             -0.14                0.16
## OpenCreditLines                -0.06              0.15                0.19
## AmountDelinquent                1.00             -0.03                0.02
## DebtToIncomeRatio              -0.03              1.00               -0.14
## StatedMonthlyIncome             0.02             -0.14                1.00
## MonthlyLoanPayment             -0.02              0.00                0.30
## CreditScore                    -0.05             -0.01                0.10
##                     MonthlyLoanPayment CreditScore
## BorrowerRate                     -0.33       -0.53
## ProsperScore                      0.18        0.38
## OpenCreditLines                   0.19        0.10
## AmountDelinquent                 -0.02       -0.05
## DebtToIncomeRatio                 0.00       -0.01
## StatedMonthlyIncome               0.30        0.10
## MonthlyLoanPayment                1.00        0.22
## CreditScore                       0.22        1.00

Borrower rate has a moderate (to strong) negative correlation with Prosper score and credit score. There are also weak relationships between some of the other variables. As credit score goes up, Prosper score tends to go up. A higher monthly loan payment also weakly correlates with a higher stated monthly income and a lower borrower rate. Borrower rate seems to have moderately strong correlations with the most variables.

The Prosper score seems to be moderately correlated with borrower rate (negatively) and credit score. Credit score also seems to be moderately correlated with monthly loan payment and negatively with borrower rate.

There don’t seem to be any obvious relationships between Prosper score and any of the categorical variables. I also included borrower rate to see if any of the categorical variables correlated with it, but the relationships are weak.

I want to take a closer look at the relationships between Prosper score and the variables borrower rate, monthly loan payment, credit score, and debt to income ratio.

As Prosper score increases, borrower rate tends to decrease. At the low end and high end of the x-axis the range in borrower rate is lower. In the middle of the plot (Prosper score of 4-8) we can see a much bigger range in the borrower rate.

The slight upward trend in monthly loan payment as Prosper score increases can be seen here. It’s also fairly clear that as Prosper score increases, the biggest change to monthly loan payments is a wider range and greater volume of values. We don’t see many monthly loan payments greater than $500 for Prosper scores of 1-3, but the distribution of monthly loan payments for Prosper scores greater than 3 do not differ much as the score increases. Basically, borrowers with low Prosper scores do not get loans with high monthly payments.

Similarly to the borrower rate, a low Prosper score typically corresponds to a low credit score and a high Prosper score with a high credit score, but the middle range of Prosper scores correspond to a wide range of credit scores. In other words, it’s unlikely for a borrower with a low credit score to have a high Prosper score, and vice versa for a borrower with a high credit score. If the borrower has a middle range credit score (650-750), their Prosper score can range vastly.

Here we can see the slight negative correlation between debt to income ratio and Prosper score.

Let us also take a look at borrower rate vs. monthly loan payment and credit score as these variables had moderate correlations with borrower rate.

Changing the plot limits and adding transparency reveals clear sections of the plot where borrower rate increases linearly as the monthly loan payment increases. Also, the highest borrower rates tend to correspond to lower monthly payments. I wonder what may cause the borrower rate to increase without much or any increase to monthly loan payment.

The correlation between credit score and borrower rate is clear here - as credit score increases, borrower rate decreases.

Next, I’ll look at how Prosper score and borrower rate may fluctuate based on the categorical variables (loan status, employment, and homeowners). It is important to remember here that the loan status and employment status levels were ordered by my own thoughts on levels of borrower risk and ideal loan outcomes.

As mentioned above, it’s important to remember that the order of these levels was determined by my own judgment. There is a slight upward trend in Prosper score across loan status, but the defaulted loans Prosper score values seem to throw off this trend. There also may be a slight downward trend in borrower rate across loan status, but it is not strong at all.

Lower average Prosper scores can be seen for borrowers with employment statuses of not employed, other, and self-employed, while higher average scores correspond with retired, part-time and full-time. Based on this, I will begin treating employment status as a nominal variable. Trying to order the employment status categories does not seem feasible any longer.

The borrower rate is generally higher for those that are not employed, but otherwise there isn’t much difference across employment levels.

The borrower’s status as a homeowner does not appear to affect the Prosper score at all, but not owning a home does seem to correspond with a higher borrower rate.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The Prosper score and borrower rate had a moderate negative correlation, with the borrower rate decreasing as Prosper score increased. There was a big range of values in the middle section of Prosper scores (and the middle section of borrower rates).

Low Prosper scores make it highly unlikely that the monthly loan payment will be greater than $500.

A well-below average credit score generally corresponds to a low Prosper score, and a well-above average credit score generally corresponds to a high Prosper score. A credit score between 650-750, on the other hand, corresponds to Prosper scores that can range vastly.

Debt to income ratio increases tended to weakly correlate with lower Prosper scores.

There was a slight upward trend in Prosper score across loan status, but the defaulted loans Prosper score values seem to throw off this trend.

Lower average Prosper scores can be seen for borrowers with employment statuses of not employed, other, and self-employed, while higher average scores correspond with retired, part-time and full-time.

It was surprising to see that being a homeowner had no affect on Prosper score.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Borrower rate tends to decrease as credit score increases, which makes sense. I was also very interested in the plot between borrower rate and monthly loan payment. There were clear segments that showed sectional linear relationships between borrower rate and monthly loan payment. I’m wondering if the distinction of these sections corresponds to any other variables, or if the linear relationships are affected by another variable.

What was the strongest relationship you found?

The strongest bivariate relationship was between borrower rate and Prosper score. Credit score also had a fairly strong correlation with borrower rate.

Multivariate Plots Section

First I want to investigate the relationship above between borrower rate and monthly loan payment against the categorical variables and against Prosper score. This relationship really caught my eye and I want to see if there is more to learn about it.

It’s hard to pull any trends out of the data here. Most of the data contains borrowers who are employed or full-time. For loans with low borrower rates and mid-to-high monthly loan payments, its most common for the borrower to be employed.

Most loans are current, completed, or defaulted, but its hard to see any obvious trends.

Again, there is no obvious relationship here - homeowners may be slightly more likely to have higher monthly payments and lower borrower rates.

Here there is definitely a relationship, with borrower rate increasing as the Prosper score decreases while holding the monthly loan payment constant. It seems that the linear relationship pieces of the plot noted previously seem to correspond to decreases in Prosper score.

Let us move on and look at Prosper score and borrower rate against the categorical variables.

There are no new trends that can be observed from these plots.

Next I want to look at borrower rate and credit score against the categorical variables and also against Prosper score.

Low credit scores tend to correspond to defaulted loans as well as employment statuses of not employed or no answer.

High credit scores tend to correspond to homeowners, while low credit scores correspond to non-homeowners.

Borrowers with a higher Prosper score tend to have lower borrower rates, even if we hold credit score constant.

At first I had wanted to start out by exploring the Prosper score, but the data exploration has made me more interested in predicting the borrower rate. The plots colored by Prosper score show the moderately strong relationships borrower rate has with other variables in the dataset. I want to create a linear model to see how well we can predict borrower rate.

## 
## Calls:
## m1: lm(formula = BorrowerRate ~ ProsperScore, data = loans_subset_CreditScore)
## m2: lm(formula = BorrowerRate ~ ProsperScore + CreditScore, data = loans_subset_CreditScore)
## m3: lm(formula = BorrowerRate ~ ProsperScore + CreditScore + MonthlyLoanPayment, 
##     data = loans_subset_CreditScore)
## 
## ======================================================================
##                             m1              m2              m3        
## ----------------------------------------------------------------------
##   (Intercept)               0.317***        0.648***        0.630***  
##                            (0.001)         (0.003)         (0.003)    
##   ProsperScore             -0.020***       -0.017***       -0.016***  
##                            (0.000)         (0.000)         (0.000)    
##   CreditScore                              -0.000***       -0.000***  
##                                            (0.000)         (0.000)    
##   MonthlyLoanPayment                                       -0.000***  
##                                                            (0.000)    
## ----------------------------------------------------------------------
##   R-squared                 0.421           0.504           0.535     
##   adj. R-squared            0.421           0.504           0.535     
##   sigma                     0.057           0.052           0.051     
##   F                     61498.813       43106.032       32531.390     
##   p                         0.000           0.000           0.000     
##   Log-likelihood       122853.694      129468.834      132199.794     
##   Deviance                272.796         233.352         218.782     
##   AIC                 -245701.389     -258929.668     -264389.588     
##   BIC                 -245673.348     -258892.280     -264342.853     
##   N                     84714           84714           84714         
## ======================================================================

The model accounts for 53.5% of the variance in the borrower rate for a loan.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Holding monthly loan payments constant, higher Prosper scores generally equate to lower borrower rates. The same relationship can be seen when we hold credit scores constant.

These plots along with the correlations seen in the bivariate plots section led me to create a linear regression model with these variables.

Were there any interesting or surprising interactions between features?

I was mostly surprised to see the absence of any obvious trends across loan status, employment status, and homeownership (other than credit score).

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

For the amount of variables used in this exploration (compared to the 81 overall variables in the full data set), I think the linear model above is fairly good. With the complexities of all of the factors that go into a loan interest rate, I was surprised to see that Prosper score, credit score, and monthly loan payment alone account for 53.5% of the variance in the borrower rate.

The model obviously has its limitations since there are many other variables that likely affect affect borrower rate in the full data set (and outside it). The date the loan was taken out, the borrower’s state of residence, the type of loan, and many other variables likely have effects on the borrower rate. This is definitely not a perfect model, but I do find it relevant to see that three variables alone can account for over half of the variance in a loan rate.


Final Plots and Summary

Plot One

Description One

The borrower’s interest rate distribution is fairly normal, but contains an interesting peak around 0.31. The loan company may approve some ‘riskier’ borrowers for loans and give them higher interest rates in an attempt to make more money.

Plot Two

Description Two

Borrower rate and monthly loan payment have a very interesting relationship as seen here. There are clear linear trends within the plot. At lower monthly loan payments, the borrower rate can rise quickly for the same loan payment. As the monthly loan payments go up, borrower rate and loan payment increase together.

Plot Three

Description Three

Prosper score gives a better understanding to the relationship between borrower rate and monthly loan payment. Keeping the monthly loan payment constant, the borrower rate increases as Prosper score decreases (indicating a higher risk borrower). As the monthly loan payment increases, the loan company is taking on a higher risk if the borrower does not pay the money back since they will lose more money. For this reason, the loan company increases both the borrower rate and monthly loan payment as the Prosper score decreases. This way, the loan company is able to earn more interest from the borrower if they do pay the loan back. The extra revenue from interest makes it worth it for them to approve the loan with a higher loan payment for riskier borrowers.


Reflection

The full dataset used here contains almost 114,000 loans with 81 variables. There is a ton of information in the dataset, and in this exploration I decided to focus on 13 variables that peaked my interest.

I began by exploring the individual distributions of the variables in the dataset. Then, I looked at relationships between the variables and further explored the strongest trends. The exploration culminated in an interest in creating a linear model to predict borrower rate based on three of the variables.

Although I started off most interested in Prosper score, the trends among the variables I chose to explore made borrower rate the more intriguing variable. By choosing only 13 variables to explore from the full dataset, I absolutely introduced some limitations to the model. There are many other factors that I did not explore relating to the borrower rate, which is the reason the model only accounts for 53.5% of the variance in borrower rate.

Other limitations include the fact that this data all comes from Prosper. The relationships and models that come from this dataset may be totally different for loans at another company. The dataset also hasn’t been updated since 2014. Loan rates change frequently and so this data may not properly represent the current state of loans.

As mentioned above, there are 81 variables in the full dataset for each loan. I’d be interested in exploring the other variables in depth as well to see which other variables may contribute to borrower rate. With the full dataset, I also think a model could be built to predict the Prosper score. It would be interesting to see which factors go into the generation of a Prosper score and if a model could be built that could accurately predict what a borrower’s Prosper score would be.

======= Loan Data from Prosper

========================================================

This dataset contains 113,937 loans with 81 variables for each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information. The data set was last updated on March 11, 2014.

Univariate Plots Section

## [1] 113937     81

In this analysis, we will limit the amount of variables we analyze to 13 rather than 81 variables. The variables below were selected out of an interest to see any relationships and trends between them.

## [1] 113937     13
## 'data.frame':    113937 obs. of  13 variables:
##  $ Term                 : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus           : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ BorrowerRate         : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ ProsperScore         : num  NA 7 NA 9 4 10 2 4 9 11 ...
##  $ EmploymentStatus     : Factor w/ 9 levels "","Employed",..: 9 2 4 2 2 2 2 2 2 2 ...
##  $ IsBorrowerHomeowner  : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ CreditScoreRangeLower: int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper: int  659 699 499 819 699 759 699 719 839 839 ...
##  $ OpenCreditLines      : int  4 14 NA 5 19 17 7 6 16 16 ...
##  $ AmountDelinquent     : num  472 0 NA 10056 0 ...
##  $ DebtToIncomeRatio    : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ StatedMonthlyIncome  : num  3083 6125 2083 2875 9583 ...
##  $ MonthlyLoanPayment   : num  330 319 123 321 564 ...

I will try to order the factor levels of loan status and employment status by perceived level of benefit and risk for the loan company (from worst to best). I understand this may be subjective, but I want to see how the distributions and plots display with these variables ordered.

##  [1] "ChargedOff"             "Defaulted"             
##  [3] "Cancelled"              "Past Due (>120 days)"  
##  [5] "Past Due (91-120 days)" "Past Due (61-90 days)" 
##  [7] "Past Due (31-60 days)"  "Past Due (16-30 days)" 
##  [9] "Past Due (1-15 days)"   "FinalPaymentInProgress"
## [11] "Current"                "Completed"
## [1] "Not employed"  "No answer"     "Not available" "Other"        
## [5] "Retired"       "Self-employed" "Part-time"     "Employed"     
## [9] "Full-time"
##       Term                       LoanStatus     BorrowerRate   
##  Min.   :12.00   Current              :56576   Min.   :0.0000  
##  1st Qu.:36.00   Completed            :38074   1st Qu.:0.1340  
##  Median :36.00   Defaulted            : 5018   Median :0.1840  
##  Mean   :40.83   Past Due (1-15 days) :  806   Mean   :0.1928  
##  3rd Qu.:36.00   Past Due (31-60 days):  363   3rd Qu.:0.2500  
##  Max.   :60.00   (Other)              : 1108   Max.   :0.4975  
##                  NA's                 :11992                   
##   ProsperScore        EmploymentStatus IsBorrowerHomeowner
##  Min.   : 1.00   Employed     :67322   False:56459        
##  1st Qu.: 4.00   Full-time    :26355   True :57478        
##  Median : 6.00   Self-employed: 6134                      
##  Mean   : 5.95   Not available: 5347                      
##  3rd Qu.: 8.00   Other        : 3806                      
##  Max.   :11.00   No answer    : 2255                      
##  NA's   :29084   (Other)      : 2718                      
##  CreditScoreRangeLower CreditScoreRangeUpper OpenCreditLines
##  Min.   :  0.0         Min.   : 19.0         Min.   : 0.00  
##  1st Qu.:660.0         1st Qu.:679.0         1st Qu.: 6.00  
##  Median :680.0         Median :699.0         Median : 9.00  
##  Mean   :685.6         Mean   :704.6         Mean   : 9.26  
##  3rd Qu.:720.0         3rd Qu.:739.0         3rd Qu.:12.00  
##  Max.   :880.0         Max.   :899.0         Max.   :54.00  
##  NA's   :591           NA's   :591           NA's   :7604   
##  AmountDelinquent   DebtToIncomeRatio StatedMonthlyIncome
##  Min.   :     0.0   Min.   : 0.000    Min.   :      0    
##  1st Qu.:     0.0   1st Qu.: 0.140    1st Qu.:   3200    
##  Median :     0.0   Median : 0.220    Median :   4667    
##  Mean   :   984.5   Mean   : 0.276    Mean   :   5608    
##  3rd Qu.:     0.0   3rd Qu.: 0.320    3rd Qu.:   6825    
##  Max.   :463881.0   Max.   :10.010    Max.   :1750003    
##  NA's   :7622       NA's   :8554                         
##  MonthlyLoanPayment
##  Min.   :   0.0    
##  1st Qu.: 131.6    
##  Median : 217.7    
##  Mean   : 272.5    
##  3rd Qu.: 371.6    
##  Max.   :2251.5    
## 

## 
##    12    36    60 
##  1614 87778 24545

There are only three distinct values to the loan term (12, 36, and 60 months). Going forward, we will treat the Term variable as a factor due to its categorical nature.

We can clearly see that most loans have a term of 36 months, and it’s much less common for a loan to have a term of 12 months.

## 
##             ChargedOff              Defaulted              Cancelled 
##                      0                   5018                      5 
##   Past Due (>120 days) Past Due (91-120 days)  Past Due (61-90 days) 
##                     16                    304                    313 
##  Past Due (31-60 days)  Past Due (16-30 days)   Past Due (1-15 days) 
##                    363                    265                    806 
## FinalPaymentInProgress                Current              Completed 
##                    205                  56576                  38074

Most of the loans are either Current or Completed, but there are also many that are Charged off or Defaulted. Of the loans that are Past Due, the highest number are past due by 1-15 days. There are hardly any loans that are categorized as > 120 days past due. My theory is that this is due to the loans being considered Charged off shortly after they become > 120 days past due.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   131.6   217.7   272.5   371.6  2251.5

Monthly loan payment is skewed to the right, with a minimum of $0 and a maximum of $2251.50. The minimum of $0 is interesting, as I expected all monthly loan payments to be greater than $0. I want to take a closer look at loans with monthly payments of $0 and those with high monthly payments.

##  Term                      LoanStatus   BorrowerRate     ProsperScore   
##  12:161   Completed             :800   Min.   :0.0500   Min.   : 1.000  
##  36:774   Defaulted             :131   1st Qu.:0.1224   1st Qu.: 6.000  
##  60:  0   FinalPaymentInProgress:  4   Median :0.1875   Median : 8.000  
##           ChargedOff            :  0   Mean   :0.1943   Mean   : 7.605  
##           Cancelled             :  0   3rd Qu.:0.2635   3rd Qu.: 9.000  
##           Past Due (>120 days)  :  0   Max.   :0.3600   Max.   :10.000  
##           (Other)               :  0                    NA's   :451     
##       EmploymentStatus IsBorrowerHomeowner CreditScoreRangeLower
##  Full-time    :444     False:558           Min.   :  0.0        
##  Employed     :269     True :377           1st Qu.:580.0        
##  Not available: 79                         Median :660.0        
##  No answer    : 39                         Mean   :652.7        
##  Self-employed: 35                         3rd Qu.:720.0        
##  Part-time    : 35                         Max.   :860.0        
##  (Other)      : 34                         NA's   :11           
##  CreditScoreRangeUpper OpenCreditLines  AmountDelinquent 
##  Min.   : 19.0         Min.   : 0.000   Min.   :    0.0  
##  1st Qu.:599.0         1st Qu.: 4.000   1st Qu.:    0.0  
##  Median :679.0         Median : 7.000   Median :    0.0  
##  Mean   :671.7         Mean   : 7.543   Mean   :  878.4  
##  3rd Qu.:739.0         3rd Qu.:10.000   3rd Qu.:   20.0  
##  Max.   :879.0         Max.   :27.000   Max.   :81723.0  
##  NA's   :11            NA's   :118      NA's   :118      
##  DebtToIncomeRatio StatedMonthlyIncome MonthlyLoanPayment
##  Min.   : 0.0100   Min.   :     0      Min.   :0         
##  1st Qu.: 0.1000   1st Qu.:  2750      1st Qu.:0         
##  Median : 0.1700   Median :  4110      Median :0         
##  Mean   : 0.2441   Mean   :  4836      Mean   :0         
##  3rd Qu.: 0.2800   3rd Qu.:  5944      3rd Qu.:0         
##  Max.   :10.0100   Max.   :108333      Max.   :0         
##  NA's   :64

Subsetting the data to view loans with monthly payments of $0, we can see that categories such as CreditScoreRangeLower and StatedMonthlyIncome are lower than the overall median. Interestingly, the ProsperScore is higher than average for this subset.

##  Term                    LoanStatus  BorrowerRate      ProsperScore  
##  12:8   Completed             :7    Min.   :0.05650   Min.   : 3.00  
##  36:0   FinalPaymentInProgress:1    1st Qu.:0.06248   1st Qu.: 7.25  
##  60:0   ChargedOff            :0    Median :0.07715   Median : 8.00  
##         Defaulted             :0    Mean   :0.09001   Mean   : 7.50  
##         Cancelled             :0    3rd Qu.:0.11865   3rd Qu.: 8.50  
##         Past Due (>120 days)  :0    Max.   :0.14580   Max.   :10.00  
##         (Other)               :0                                     
##       EmploymentStatus IsBorrowerHomeowner CreditScoreRangeLower
##  Employed     :8       False:4             Min.   :740.0        
##  Not employed :0       True :4             1st Qu.:755.0        
##  No answer    :0                           Median :760.0        
##  Not available:0                           Mean   :767.5        
##  Other        :0                           3rd Qu.:780.0        
##  Retired      :0                           Max.   :820.0        
##  (Other)      :0                                                
##  CreditScoreRangeUpper OpenCreditLines AmountDelinquent DebtToIncomeRatio
##  Min.   :759.0         Min.   : 6.00   Min.   :0        Min.   :0.1100   
##  1st Qu.:774.0         1st Qu.: 7.75   1st Qu.:0        1st Qu.:0.1625   
##  Median :779.0         Median : 9.50   Median :0        Median :0.1800   
##  Mean   :786.5         Mean   :11.00   Mean   :0        Mean   :0.2137   
##  3rd Qu.:799.0         3rd Qu.:13.25   3rd Qu.:0        3rd Qu.:0.2550   
##  Max.   :839.0         Max.   :21.00   Max.   :0        Max.   :0.3900   
##                                                                          
##  StatedMonthlyIncome MonthlyLoanPayment
##  Min.   : 7292       Min.   :2112      
##  1st Qu.:12333       1st Qu.:2144      
##  Median :18125       Median :2159      
##  Mean   :16589       Mean   :2170      
##  3rd Qu.:20208       3rd Qu.:2189      
##  Max.   :25000       Max.   :2252      
## 

Subsetting the data to view loans with monthly payments over $2000, it seems that the borrower generally has a high stated monthly income, has no amount delinquent, is employed, and has a high credit score. Also, the terms are all 12 months and the borrower rates are all under 0.15. The borrowers in these loans appear to be “low risk” borrowers as the higher median Prosper Score indicates.

The Prosper Score appears to be a normal distribution. The plot above excludes any ‘NA’ values. The Prosper Score is also supposed to range from 1-10, so the scores of 11 were unexpected. I’m hesitant to treat a score of 11 as the best score since it may be an entry error. Let us subset the data to see if there is a trend in the loans with a Prosper Score of 11 and if the data is trustworthy.

##  Term                       LoanStatus    BorrowerRate      ProsperScore
##  12:   0   Current               :1419   Min.   :0.06050   Min.   :11   
##  36:1049   Completed             :  31   1st Qu.:0.06590   1st Qu.:11   
##  60: 407   FinalPaymentInProgress:   3   Median :0.08690   Median :11   
##            Defaulted             :   1   Mean   :0.09328   Mean   :11   
##            Past Due (61-90 days) :   1   3rd Qu.:0.10990   3rd Qu.:11   
##            Past Due (1-15 days)  :   1   Max.   :0.19500   Max.   :11   
##            (Other)               :   0                                  
##       EmploymentStatus IsBorrowerHomeowner CreditScoreRangeLower
##  Employed     :1365    False:483           Min.   :640          
##  Self-employed:  43    True :973           1st Qu.:700          
##  Other        :  34                        Median :720          
##  Full-time    :  14                        Mean   :732          
##  Not employed :   0                        3rd Qu.:760          
##  No answer    :   0                        Max.   :840          
##  (Other)      :   0                                             
##  CreditScoreRangeUpper OpenCreditLines AmountDelinquent DebtToIncomeRatio
##  Min.   :659           Min.   : 2.00   Min.   :     0   Min.   :0.0200   
##  1st Qu.:719           1st Qu.: 8.00   1st Qu.:     0   1st Qu.:0.1300   
##  Median :739           Median :11.00   Median :     0   Median :0.1900   
##  Mean   :751           Mean   :11.32   Mean   :   298   Mean   :0.2007   
##  3rd Qu.:779           3rd Qu.:14.00   3rd Qu.:     0   3rd Qu.:0.2500   
##  Max.   :859           Max.   :38.00   Max.   :106071   Max.   :0.5800   
##                                                         NA's   :44       
##  StatedMonthlyIncome MonthlyLoanPayment
##  Min.   : 1200       Min.   :  43.47   
##  1st Qu.: 5417       1st Qu.: 243.56   
##  Median : 7500       Median : 402.31   
##  Mean   : 8850       Mean   : 424.04   
##  3rd Qu.:10833       3rd Qu.: 581.40   
##  Max.   :76667       Max.   :1162.33   
## 

The summary of loans with a Prosper Score of 11 do seem to indicate borrowers that are low risk. The median StatedMonthlyIncome and OpenCreditLines are higher than the overall median. I think this data can be trusted and used.

Borrower Rate has a fairly normal distribution, but there is a second peak around 0.31. The Prosper Score and Borrower Rate are both generated values, and I’m interested in exploring which variables affect these values.

Almost all of the borrowers are employed. A limitation of this dataset that can be recognized here is that the data is only for those who applied for a loan and were approved. The “riskier”" loan applicants (unemployed) that were denied loans are not a part of this dataset.

I also wonder if leaving the response blank, or the unclear responses of “Not available” and “Other” affect the borrower’s interest rate and Prosper score.

There is an even split between homeowners and non-homeowners. Does being a homeowner affect the Prosper Score? How about the borrower rate?

The stated monthly income is skewed to the right, but the log transformation produces a more normal distribution. There are very large outliers in the stated monthly income data so the top 1% of data was trimmed for the plot.

Similarly, a majority of the Debt to Income Ratios were between 0 and 1 with some values at the max of 10.01. The top 1% of data was trimmed for the plot above, showing a fairly normal distribution with some skew to the right. I want to take a look at the highest values of debt to income ratio.

##  Term                     LoanStatus   BorrowerRate     ProsperScore  
##  12:  1   Completed            :124   Min.   :0.0100   Min.   :1.000  
##  36:263   Defaulted            : 42   1st Qu.:0.1497   1st Qu.:3.000  
##  60:  8   Current              : 23   Median :0.1900   Median :4.500  
##           Past Due (61-90 days):  1   Mean   :0.2013   Mean   :4.478  
##           Past Due (16-30 days):  1   3rd Qu.:0.2699   3rd Qu.:6.000  
##           (Other)              :  0   Max.   :0.3500   Max.   :8.000  
##           NA's                 : 81                    NA's   :226    
##       EmploymentStatus IsBorrowerHomeowner CreditScoreRangeLower
##  Self-employed:74      False:145           Min.   :460.0        
##  Full-time    :65      True :127           1st Qu.:620.0        
##  Not available:37                          Median :680.0        
##  Employed     :36                          Mean   :668.8        
##  Not employed :27                          3rd Qu.:720.0        
##  Part-time    :17                          Max.   :860.0        
##  (Other)      :16                                               
##  CreditScoreRangeUpper OpenCreditLines  AmountDelinquent 
##  Min.   :479.0         Min.   : 0.000   Min.   :    0.0  
##  1st Qu.:639.0         1st Qu.: 5.000   1st Qu.:    0.0  
##  Median :699.0         Median : 8.000   Median :    0.0  
##  Mean   :687.8         Mean   : 8.694   Mean   :  734.9  
##  3rd Qu.:739.0         3rd Qu.:12.000   3rd Qu.:    0.0  
##  Max.   :879.0         Max.   :30.000   Max.   :37077.0  
##                        NA's   :43       NA's   :43       
##  DebtToIncomeRatio StatedMonthlyIncome MonthlyLoanPayment
##  Min.   :10.01     Min.   :    0.000   Min.   :   0.0    
##  1st Qu.:10.01     1st Qu.:    0.083   1st Qu.: 107.2    
##  Median :10.01     Median :    0.083   Median : 209.5    
##  Mean   :10.01     Mean   :  133.299   Mean   : 301.9    
##  3rd Qu.:10.01     3rd Qu.:    0.396   3rd Qu.: 388.6    
##  Max.   :10.01     Max.   :17083.333   Max.   :1047.6    
## 

Subsetting the data by Debt to Income Ratios greater than 10, we begin to see riskier Prosper scores, more Charged off or Defaulted loans, and much lower levels of full-time employment and stated monthly incomes.

## 
##     0   360   420   440   460   480   500   520   540   560   580   600 
##   133     1     5    36   141   346   554  1593  1474  1357  1125  3602 
##   620   640   660   680   700   720   740   760   780   800   820   840 
##  4172 12199 16366 16492 15471 12923  9267  6606  4624  2644  1409   567 
##   860   880 
##   212    27
## 
##    19   379   439   459   479   499   519   539   559   579   599   619 
##   133     1     5    36   141   346   554  1593  1474  1357  1125  3602 
##   639   659   679   699   719   739   759   779   799   819   839   859 
##  4172 12199 16366 16492 15471 12923  9267  6606  4624  2644  1409   567 
##   879   899 
##   212    27

The credit score variables both have normal distributions that look identical, but shifted on the x-axis. Looking at the tables of the values, we can see that the lower credit scores all end in 0 and the upper scores all end in 9. The scores also increment by a factor of 20 and the upper credit scores are all higher than the lower scores by exactly 19. Since these values represent a credit score range that seems to vary by exactly 19 for each borrower, we will create a new variable CreditScore that is the mean of CreditScoreRangeLower and CreditScoreRangeUpper.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##     9.5   669.5   689.5   695.1   729.5   889.5     591

According to ValuePenguin, the average credit score in the U.S. in 2014 was 693. Here we can see that the mean (695) and median (690) of the CreditScore in our data set are right around this value. I’m surprised to see here that the credit scores for those approved for loans is not higher than the average in the U.S. in 2014.

We can also see here that credit scores in our data set range from 9.5 to 889.5, but the actual range for credit scores is 300 to 850. Let’s explore some loans with credit scores outside the normal range.

##       Term LoanStatus BorrowerRate ProsperScore EmploymentStatus
## 795     36  Completed       0.2900           NA    Not available
## 913     36  Defaulted       0.2600           NA    Not available
## 1686    36       <NA>       0.1625           NA    Not available
## 4291    36       <NA>       0.2375           NA        No answer
## 4372    36  Defaulted       0.2325           NA    Not available
## 4431    36  Defaulted       0.1450           NA    Not available
## 4673    36       <NA>       0.1930           NA    Not available
## 5685    36  Completed       0.2300           NA    Not available
## 6878    36       <NA>       0.0700           NA    Not available
## 7740    36       <NA>       0.2900           NA    Not available
## 9049    36  Defaulted       0.2900           NA    Not available
## 9474    36       <NA>       0.2500           NA    Not available
## 13396   36  Defaulted       0.2375           NA        No answer
## 13526   36       <NA>       0.2000           NA    Not available
## 13687   36  Completed       0.1700           NA    Not available
## 13939   36  Defaulted       0.1700           NA    Not available
## 14339   36       <NA>       0.1700           NA    Not available
## 14726   36       <NA>       0.1700           NA    Not available
## 15094   36  Defaulted       0.2900           NA        No answer
## 15138   36       <NA>       0.2500           NA        No answer
##       IsBorrowerHomeowner CreditScoreRangeLower CreditScoreRangeUpper
## 795                 False                     0                    19
## 913                 False                     0                    19
## 1686                False                     0                    19
## 4291                False                     0                    19
## 4372                False                     0                    19
## 4431                False                     0                    19
## 4673                False                     0                    19
## 5685                False                     0                    19
## 6878                False                     0                    19
## 7740                False                     0                    19
## 9049                False                     0                    19
## 9474                False                     0                    19
## 13396                True                     0                    19
## 13526               False                     0                    19
## 13687               False                     0                    19
## 13939               False                     0                    19
## 14339               False                     0                    19
## 14726               False                     0                    19
## 15094               False                     0                    19
## 15138               False                     0                    19
##       OpenCreditLines AmountDelinquent DebtToIncomeRatio
## 795                NA               NA                NA
## 913                NA               NA           0.03000
## 1686               NA               NA                NA
## 4291               NA               NA           0.10368
## 4372               NA               NA                NA
## 4431               NA               NA                NA
## 4673               NA               NA           0.02000
## 5685               NA               NA           0.09000
## 6878               NA               NA                NA
## 7740               NA               NA           0.07000
## 9049               NA               NA           0.01000
## 9474               NA               NA           0.05000
## 13396              NA               NA           0.02393
## 13526              NA               NA                NA
## 13687              NA               NA                NA
## 13939              NA               NA           0.02000
## 14339              NA               NA                NA
## 14726              NA               NA           0.13000
## 15094              NA               NA           0.14000
## 15138              NA               NA                NA
##       StatedMonthlyIncome MonthlyLoanPayment CreditScore
## 795              115.5833             167.62         9.5
## 913             2316.6667              60.44         9.5
## 1686            1600.0000              35.28         9.5
## 4291            2916.6667             156.41         9.5
## 4372               0.0000              58.26         9.5
## 4431             166.6667              34.42         9.5
## 4673            2666.6667              44.17         9.5
## 5685            1083.3333              98.71         9.5
## 6878            2227.5000              30.88         9.5
## 7740            1400.0000             104.76         9.5
## 9049            3166.6667              41.91         9.5
## 9474            2000.0000             101.39         9.5
## 13396           4166.6667              99.71         9.5
## 13526           3896.6667             111.49         9.5
## 13687           2050.0000              35.65         9.5
## 13939           2083.3333              35.65         9.5
## 14339           1083.3333              42.78         9.5
## 14726            916.6667              57.04         9.5
## 15094           1200.0000              83.81         9.5
## 15138           1500.0000             119.28         9.5
##       Term LoanStatus BorrowerRate ProsperScore EmploymentStatus
## 186     36  Defaulted       0.2365            6    Self-employed
## 936     36  Completed       0.0499           NA        Full-time
## 1094    36    Current       0.0759            8         Employed
## 1402    36  Completed       0.0820           NA    Not available
## 1576    36    Current       0.0625           10         Employed
## 2024    36  Completed       0.3199            3    Self-employed
## 2117    36  Completed       0.0750           NA    Self-employed
## 2987    36  Completed       0.0695           NA          Retired
## 4083    36  Completed       0.0945            8        Full-time
## 4550    36  Completed       0.0854           10         Employed
## 6521    60    Current       0.0779           10         Employed
## 7183    36  Completed       0.0590           NA        Full-time
## 7449    12  Completed       0.0628            9          Retired
## 7475    36  Completed       0.1390            8         Employed
## 7835    36    Current       0.0649           10         Employed
## 8075    36  Completed       0.0830            8        Full-time
## 8731    36  Completed       0.0599           10         Employed
## 8852    36  Completed       0.0825           10         Employed
## 8993    36    Current       0.2049            7         Employed
## 10160   36  Completed       0.1559            5    Self-employed
##       IsBorrowerHomeowner CreditScoreRangeLower CreditScoreRangeUpper
## 186                  True                   860                   879
## 936                  True                   860                   879
## 1094                 True                   860                   879
## 1402                 True                   860                   879
## 1576                 True                   860                   879
## 2024                 True                   860                   879
## 2117                 True                   860                   879
## 2987                 True                   860                   879
## 4083                 True                   860                   879
## 4550                 True                   860                   879
## 6521                 True                   860                   879
## 7183                 True                   860                   879
## 7449                 True                   860                   879
## 7475                 True                   860                   879
## 7835                 True                   860                   879
## 8075                 True                   880                   899
## 8731                 True                   860                   879
## 8852                 True                   860                   879
## 8993                 True                   880                   899
## 10160                True                   860                   879
##       OpenCreditLines AmountDelinquent DebtToIncomeRatio
## 186                 5                0                NA
## 936                 4                0              0.00
## 1094                6                0              0.17
## 1402               NA               NA              0.02
## 1576                8                0              0.03
## 2024                7                0                NA
## 2117                7                0              0.03
## 2987                8                0              0.05
## 4083               12                0              0.15
## 4550                4                0              0.02
## 6521                7                0              0.08
## 7183               15                0              0.14
## 7449               13                0              0.13
## 7475               14                0              0.13
## 7835                7                0              0.06
## 8075                8                0                NA
## 8731                7                0              0.05
## 8852                5                0              0.03
## 8993                7                0              0.26
## 10160              10                0                NA
##       StatedMonthlyIncome MonthlyLoanPayment CreditScore
## 186              3333.333             234.30       869.5
## 936              9000.000              29.97       869.5
## 1094             6833.333             186.89       869.5
## 1402             5000.000              94.29       869.5
## 1576            17916.667             213.75       869.5
## 2024             6250.000             108.87       869.5
## 2117            18333.333              31.11       869.5
## 2987             1258.000              46.28       869.5
## 4083            10583.333             640.19       869.5
## 4550             8108.333              94.76       869.5
## 6521             8666.667             282.46       869.5
## 7183             7500.000             121.51       869.5
## 7449            10000.000             172.39       869.5
## 7475            32083.333             563.13       869.5
## 7835             8333.333             245.16       869.5
## 8075             6250.000             157.37       889.5
## 8731             7291.667             243.34       869.5
## 8852             9333.333             235.89       869.5
## 8993             2333.333             149.65       889.5
## 10160           10000.000             524.32       869.5

These loans do seem to be missing a fair amount of values (there are a lot of Prosper scores missing), but since the data here do contain values for many of the variables I will not remove these rows. I will remember to subset the data for any plots or calculations involving credit score to only include scores from 300 to 850.

The number of open credit lines is a fairly normal distribution. Most people have less than 20 open credit lines. The distribution of amount delinquent is, alternatively, clearly skewed to the right. Most people are not deliquent at all. A log transformation of the AmountDelinquent data produces a normal distribution.

Univariate Analysis

What is the structure of your dataset?

The data set being analyzed contains 113,937 loans with 13 variables (Term, LoanStatus, MonthlyLoanPayment, ProsperScore, BorrowerRate, EmploymentStatus, IsBorrowerHomeowner, StatedMonthlyIncome, DebtToIncomeRatio, CreditScoreRangeLower, CreditScoreRangeUpper, OpenCreditLines, and AmountDelinquent).

The original data set contained 81 variables, but in this project we are focusing on 13 variables of interest.

Some initial observations from the data are: - Most loans have a 36 month term - Borrower rate peaks around 0.15, but also again around 0.31 - The median upper credit score is about the same as the average U.S credit score - There are extreme values in the debt to income ratio at 10.01 - Some monthly loan payments are much higher than average - Some Prosper scores are an 11, outside the scale of 1-10

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is ProsperScore. I want to explore how the borrower profile such as EmploymentStatus, CreditScore, and StatedMonthlyIncome affect the ProsperScore. Also, I want to see how ProsperScore relates to the loan details such as BorrowerRate, MonthlyLoanPayment, and LoanStatus.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

As stated above, I think the variables relating to the borrower’s profile will all contribute to the ProsperScore. I think the CreditScore, StatedMonthlyIncome, and DebtToIncomeRatio will contribute the most. Further, I think the ProsperScore will correlate with the BorrowerRate and MonthlyLoanPayment, and will be predictive of the LoanStatus.

Did you create any new variables from existing variables in the dataset?

I created CreditScore by taking the average of the upper and lower credit scores, and I made Term a factor variable since the numerical values fell within 3 distinct categories. I also ordered the EmploymentStatus and LoanStatus factors based on my perception of best to worst from the view of the loan company.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I performed a log transformation on the StatedMonthlyIncome and AmountDelinquent since the data in both plots was skewed to the right. I wanted to see if the log transformation would normalize the data. The transformed data did create a more normal looking distribution for both variables.

Bivariate Plots Section

##                     BorrowerRate ProsperScore OpenCreditLines
## BorrowerRate                1.00        -0.66           -0.11
## ProsperScore               -0.66         1.00           -0.01
## OpenCreditLines            -0.11        -0.01            1.00
## AmountDelinquent            0.05        -0.04           -0.06
## DebtToIncomeRatio           0.13        -0.14            0.15
## StatedMonthlyIncome        -0.15         0.16            0.19
## MonthlyLoanPayment         -0.33         0.18            0.19
## CreditScore                -0.53         0.38            0.10
##                     AmountDelinquent DebtToIncomeRatio StatedMonthlyIncome
## BorrowerRate                    0.05              0.13               -0.15
## ProsperScore                   -0.04             -0.14                0.16
## OpenCreditLines                -0.06              0.15                0.19
## AmountDelinquent                1.00             -0.03                0.02
## DebtToIncomeRatio              -0.03              1.00               -0.14
## StatedMonthlyIncome             0.02             -0.14                1.00
## MonthlyLoanPayment             -0.02              0.00                0.30
## CreditScore                    -0.05             -0.01                0.10
##                     MonthlyLoanPayment CreditScore
## BorrowerRate                     -0.33       -0.53
## ProsperScore                      0.18        0.38
## OpenCreditLines                   0.19        0.10
## AmountDelinquent                 -0.02       -0.05
## DebtToIncomeRatio                 0.00       -0.01
## StatedMonthlyIncome               0.30        0.10
## MonthlyLoanPayment                1.00        0.22
## CreditScore                       0.22        1.00

Borrower rate has a moderate (to strong) negative correlation with Prosper score and credit score. There are also weak relationships between some of the other variables. As credit score goes up, Prosper score tends to go up. A higher monthly loan payment also weakly correlates with a higher stated monthly income and a lower borrower rate. Borrower rate seems to have moderately strong correlations with the most variables.

The Prosper score seems to be moderately correlated with borrower rate (negatively) and credit score. Credit score also seems to be moderately correlated with monthly loan payment and negatively with borrower rate.

There don’t seem to be any obvious relationships between Prosper score and any of the categorical variables. I also included borrower rate to see if any of the categorical variables correlated with it, but the relationships are weak.

I want to take a closer look at the relationships between Prosper score and the variables borrower rate, monthly loan payment, credit score, and debt to income ratio.

As Prosper score increases, borrower rate tends to decrease. At the low end and high end of the x-axis the range in borrower rate is lower. In the middle of the plot (Prosper score of 4-8) we can see a much bigger range in the borrower rate.

The slight upward trend in monthly loan payment as Prosper score increases can be seen here. It’s also fairly clear that as Prosper score increases, the biggest change to monthly loan payments is a wider range and greater volume of values. We don’t see many monthly loan payments greater than $500 for Prosper scores of 1-3, but the distribution of monthly loan payments for Prosper scores greater than 3 do not differ much as the score increases. Basically, borrowers with low Prosper scores do not get loans with high monthly payments.

Similarly to the borrower rate, a low Prosper score typically corresponds to a low credit score and a high Prosper score with a high credit score, but the middle range of Prosper scores correspond to a wide range of credit scores. In other words, it’s unlikely for a borrower with a low credit score to have a high Prosper score, and vice versa for a borrower with a high credit score. If the borrower has a middle range credit score (650-750), their Prosper score can range vastly.

Here we can see the slight negative correlation between debt to income ratio and Prosper score.

Let us also take a look at borrower rate vs. monthly loan payment and credit score as these variables had moderate correlations with borrower rate.

Changing the plot limits and adding transparency reveals clear sections of the plot where borrower rate increases linearly as the monthly loan payment increases. Also, the highest borrower rates tend to correspond to lower monthly payments. I wonder what may cause the borrower rate to increase without much or any increase to monthly loan payment.

The correlation between credit score and borrower rate is clear here - as credit score increases, borrower rate decreases.

Next, I’ll look at how Prosper score and borrower rate may fluctuate based on the categorical variables (loan status, employment, and homeowners). It is important to remember here that the loan status and employment status levels were ordered by my own thoughts on levels of borrower risk and ideal loan outcomes.

As mentioned above, it’s important to remember that the order of these levels was determined by my own judgment. There is a slight upward trend in Prosper score across loan status, but the defaulted loans Prosper score values seem to throw off this trend. There also may be a slight downward trend in borrower rate across loan status, but it is not strong at all.

Lower average Prosper scores can be seen for borrowers with employment statuses of not employed, other, and self-employed, while higher average scores correspond with retired, part-time and full-time. Based on this, I will begin treating employment status as a nominal variable. Trying to order the employment status categories does not seem feasible any longer.

The borrower rate is generally higher for those that are not employed, but otherwise there isn’t much difference across employment levels.

The borrower’s status as a homeowner does not appear to affect the Prosper score at all, but not owning a home does seem to correspond with a higher borrower rate.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The Prosper score and borrower rate had a moderate negative correlation, with the borrower rate decreasing as Prosper score increased. There was a big range of values in the middle section of Prosper scores (and the middle section of borrower rates).

Low Prosper scores make it highly unlikely that the monthly loan payment will be greater than $500.

A well-below average credit score generally corresponds to a low Prosper score, and a well-above average credit score generally corresponds to a high Prosper score. A credit score between 650-750, on the other hand, corresponds to Prosper scores that can range vastly.

Debt to income ratio increases tended to weakly correlate with lower Prosper scores.

There was a slight upward trend in Prosper score across loan status, but the defaulted loans Prosper score values seem to throw off this trend.

Lower average Prosper scores can be seen for borrowers with employment statuses of not employed, other, and self-employed, while higher average scores correspond with retired, part-time and full-time.

It was surprising to see that being a homeowner had no affect on Prosper score.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Borrower rate tends to decrease as credit score increases, which makes sense. I was also very interested in the plot between borrower rate and monthly loan payment. There were clear segments that showed sectional linear relationships between borrower rate and monthly loan payment. I’m wondering if the distinction of these sections corresponds to any other variables, or if the linear relationships are affected by another variable.

What was the strongest relationship you found?

The strongest bivariate relationship was between borrower rate and Prosper score. Credit score also had a fairly strong correlation with borrower rate.

Multivariate Plots Section

First I want to investigate the relationship above between borrower rate and monthly loan payment against the categorical variables and against Prosper score. This relationship really caught my eye and I want to see if there is more to learn about it.

It’s hard to pull any trends out of the data here. Most of the data contains borrowers who are employed or full-time. For loans with low borrower rates and mid-to-high monthly loan payments, its most common for the borrower to be employed.

Most loans are current, completed, or defaulted, but its hard to see any obvious trends.

Again, there is no obvious relationship here - homeowners may be slightly more likely to have higher monthly payments and lower borrower rates.

Here there is definitely a relationship, with borrower rate increasing as the Prosper score decreases while holding the monthly loan payment constant. It seems that the linear relationship pieces of the plot noted previously seem to correspond to decreases in Prosper score.

Let us move on and look at Prosper score and borrower rate against the categorical variables.

There are no new trends that can be observed from these plots.

Next I want to look at borrower rate and credit score against the categorical variables and also against Prosper score.

Low credit scores tend to correspond to defaulted loans as well as employment statuses of not employed or no answer.

High credit scores tend to correspond to homeowners, while low credit scores correspond to non-homeowners.

Borrowers with a higher Prosper score tend to have lower borrower rates, even if we hold credit score constant.

At first I had wanted to start out by exploring the Prosper score, but the data exploration has made me more interested in predicting the borrower rate. The plots colored by Prosper score show the moderately strong relationships borrower rate has with other variables in the dataset. I want to create a linear model to see how well we can predict borrower rate.

## 
## Calls:
## m1: lm(formula = BorrowerRate ~ ProsperScore, data = loans_subset_CreditScore)
## m2: lm(formula = BorrowerRate ~ ProsperScore + CreditScore, data = loans_subset_CreditScore)
## m3: lm(formula = BorrowerRate ~ ProsperScore + CreditScore + MonthlyLoanPayment, 
##     data = loans_subset_CreditScore)
## 
## ======================================================================
##                             m1              m2              m3        
## ----------------------------------------------------------------------
##   (Intercept)               0.317***        0.648***        0.630***  
##                            (0.001)         (0.003)         (0.003)    
##   ProsperScore             -0.020***       -0.017***       -0.016***  
##                            (0.000)         (0.000)         (0.000)    
##   CreditScore                              -0.000***       -0.000***  
##                                            (0.000)         (0.000)    
##   MonthlyLoanPayment                                       -0.000***  
##                                                            (0.000)    
## ----------------------------------------------------------------------
##   R-squared                 0.421           0.504           0.535     
##   adj. R-squared            0.421           0.504           0.535     
##   sigma                     0.057           0.052           0.051     
##   F                     61498.813       43106.032       32531.390     
##   p                         0.000           0.000           0.000     
##   Log-likelihood       122853.694      129468.834      132199.794     
##   Deviance                272.796         233.352         218.782     
##   AIC                 -245701.389     -258929.668     -264389.588     
##   BIC                 -245673.348     -258892.280     -264342.853     
##   N                     84714           84714           84714         
## ======================================================================

The model accounts for 53.5% of the variance in the borrower rate for a loan.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Holding monthly loan payments constant, higher Prosper scores generally equate to lower borrower rates. The same relationship can be seen when we hold credit scores constant.

These plots along with the correlations seen in the bivariate plots section led me to create a linear regression model with these variables.

Were there any interesting or surprising interactions between features?

I was mostly surprised to see the absence of any obvious trends across loan status, employment status, and homeownership (other than credit score).

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

For the amount of variables used in this exploration (compared to the 81 overall variables in the full data set), I think the linear model above is fairly good. With the complexities of all of the factors that go into a loan interest rate, I was surprised to see that Prosper score, credit score, and monthly loan payment alone account for 53.5% of the variance in the borrower rate.

The model obviously has its limitations since there are many other variables that likely affect affect borrower rate in the full data set (and outside it). The date the loan was taken out, the borrower’s state of residence, the type of loan, and many other variables likely have effects on the borrower rate. This is definitely not a perfect model, but I do find it relevant to see that three variables alone can account for over half of the variance in a loan rate.


Final Plots and Summary

Plot One

Description One

The borrower’s interest rate distribution is fairly normal, but contains an interesting peak around 0.31. The loan company may approve some ‘riskier’ borrowers for loans and give them higher interest rates in an attempt to make more money.

Plot Two

Description Two

Borrower rate and monthly loan payment have a very interesting relationship as seen here. There are clear linear trends within the plot. At lower monthly loan payments, the borrower rate can rise quickly for the same loan payment. As the monthly loan payments go up, borrower rate and loan payment increase together.

Plot Three

Description Three

Prosper score gives a better understanding to the relationship between borrower rate and monthly loan payment. Keeping the monthly loan payment constant, the borrower rate increases as Prosper score decreases (indicating a higher risk borrower). As the monthly loan payment increases, the loan company is taking on a higher risk if the borrower does not pay the money back since they will lose more money. For this reason, the loan company increases both the borrower rate and monthly loan payment as the Prosper score decreases. This way, the loan company is able to earn more interest from the borrower if they do pay the loan back. The extra revenue from interest makes it worth it for them to approve the loan with a higher loan payment for riskier borrowers.


Reflection

The full dataset used here contains almost 114,000 loans with 81 variables. There is a ton of information in the dataset, and in this exploration I decided to focus on 13 variables that peaked my interest.

I began by exploring the individual distributions of the variables in the dataset. Then, I looked at relationships between the variables and further explored the strongest trends. The exploration culminated in an interest in creating a linear model to predict borrower rate based on three of the variables.

Although I started off most interested in Prosper score, the trends among the variables I chose to explore made borrower rate the more intriguing variable. By choosing only 13 variables to explore from the full dataset, I absolutely introduced some limitations to the model. There are many other factors that I did not explore relating to the borrower rate, which is the reason the model only accounts for 53.5% of the variance in borrower rate.

Other limitations include the fact that this data all comes from Prosper. The relationships and models that come from this dataset may be totally different for loans at another company. The dataset also hasn’t been updated since 2014. Loan rates change frequently and so this data may not properly represent the current state of loans.

As mentioned above, there are 81 variables in the full dataset for each loan. I’d be interested in exploring the other variables in depth as well to see which other variables may contribute to borrower rate. With the full dataset, I also think a model could be built to predict the Prosper score. It would be interesting to see which factors go into the generation of a Prosper score and if a model could be built that could accurately predict what a borrower’s Prosper score would be.

>>>>>>> 762b36054f74711cc2af9a685fa21ebd3d3701ab